feat(ci): add 6.1 -> 6.18 cross snapshot testing#5856
Merged
JackThomson2 merged 11 commits intofirecracker-microvm:mainfrom Apr 24, 2026
Merged
feat(ci): add 6.1 -> 6.18 cross snapshot testing#5856JackThomson2 merged 11 commits intofirecracker-microvm:mainfrom
JackThomson2 merged 11 commits intofirecracker-microvm:mainfrom
Conversation
Two bugs were preventing cross-kernel restore tests from running:
1. The glob pattern only searched one level deep under
snapshot_artifacts/, but Phase 1 artifacts are nested under an
additional test-name directory. Use recursive glob (**/) to find
snapshot directories regardless of nesting depth.
2. The "None" CPU template was only added to the search list on
x86_64, so on aarch64 instances where get_supported_cpu_templates()
returns an empty list (e.g. Neoverse N1), the loop yielded zero
pytest parameters and the test was silently skipped. Always
include "None" in the search list.
Signed-off-by: Jack Thomson <jackabt@amazon.com>
Add AL2023/linux_6.18 as a restore-only platform in the cross-snapshot pipeline for both x86_64 and aarch64. Snapshots created on 6.1 hosts are restored on 6.18 hosts to validate cross-kernel compatibility. The 6.18 platform is scoped to pipeline_cross.py only since 6.18 agents exist exclusively in the private Buildkite queue. Signed-off-by: Jack Thomson <jackabt@amazon.com>
guest_run_fio_iteration ran fio in the background and only checked that the process launched, not that IO actually succeeded. Run fio in the foreground with JSON output and assert that bytes were read from the block device. This addresses the TODO about verifying the root device is not corrupted after snapshot restore. Signed-off-by: Jack Thomson <jackabt@amazon.com>
Check that /dev/hwrng is functional after restoring a snapshot on a different host kernel version. Signed-off-by: Jack Thomson <jackabt@amazon.com>
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #5856 +/- ##
=======================================
Coverage 82.87% 82.87%
=======================================
Files 276 276
Lines 29728 29728
=======================================
Hits 24637 24637
Misses 5091 5091
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
173adb2 to
07681bc
Compare
JamesC1305
previously approved these changes
Apr 24, 2026
Manciukic
reviewed
Apr 24, 2026
Manciukic
reviewed
Apr 24, 2026
b31460e to
826dbf5
Compare
Record guest CLOCK_MONOTONIC in phase1 just before snapshotting, then read it back after cross-kernel restore and assert the delta is small. Firecracker is supposed to resume MONOTONIC from capture time (see a1fd537 "fix(kvm-clock): do not jump monotonic clock on restore"), so the delta should be near zero regardless of how long phase1 and restore are apart in the pipeline. A large delta indicates MONOTONIC jumped forward - a kvm-clock regression that could surface only on some host-kernel combinations. Signed-off-by: Jack Thomson <jackabt@amazon.com>
Add check_network_data_integrity helper that generates random bytes on the host, pushes them to the guest via SSH command-line (base64-encoded to survive argv), has the guest decode and sha256 them, and asserts the guest-side hash matches the host-side hash. This exercises the full virtio-net RX path end-to-end beyond simple connectivity checks. Signed-off-by: Jack Thomson <jackabt@amazon.com>
MemoryMonitor's is_guest_mem heuristic matches a single guest-sized VMA, but _test_balloon inflates the balloon after restore, and GuestRegionMmapExt::discard_range overlays MAP_FIXED anonymous mmaps on the reclaimed ranges (a workaround specific to private file-backed mappings from snapshot restore). This fragments the 512 MiB guest VMA into ~190 smaller ones, none of which match the heuristic, and their RSS (~336 MiB) is counted as VMM overhead. This is the only cross-kernel test that inflates the balloon post- restore, and its purpose is validating cross-kernel compatibility, not VMM memory overhead, so the monitor is skipped here as it already is in test_snapshot_phase1. Signed-off-by: Jack Thomson <jackabt@amazon.com>
The perms_aarch64 loop expects aarch64 phase1 snapshots to exist for restore steps to consume, but the snapshot-create group was x86-only, so every aarch64 restore step failed at artifact download. Add an aarch64 snapshot-create group and enable test_snapshot_phase1 on arm. Signed-off-by: Jack Thomson <jackabt@amazon.com>
Add m8i.metal-48xl (Intel Granite Rapids), m6g.metal (Graviton2) and m8g.metal-24xl (Graviton4) to the cross-restore pipeline. These pick up same-instance cross-kernel coverage only; cross-instance restore permutations are unchanged. Signed-off-by: Jack Thomson <jackabt@amazon.com>
Previously every restore step waited for the entire snapshot-create group to finish via a pipeline-wide wait step. Each restore only needs its own source snapshot, so key each create step by instance/kv and have each restore depends_on the specific source it consumes. Restores now start as soon as their source snapshot is ready. Signed-off-by: Jack Thomson <jackabt@amazon.com>
826dbf5 to
b19c381
Compare
JamesC1305
approved these changes
Apr 24, 2026
Manciukic
approved these changes
Apr 24, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Changes
The cross snapshot pipeline was heavily neglected (and actually currently skipping all tests) so firstly spent some time this this up and ensure the tests run properly. Expanded our coverage of newly onboarded instances we had yet to add to the pipeline yet.
Hardened our testing on the snapshot restore, checking the clock is working as expected in the guest, stronger checks on the networking, better check the disk etc.
Added the new 6.18 target which we will use to test if we can restore a snapshot created on 6.1 with.
Also took the opportunity to fix up the ordering of the pipeline so we're not blocking on all instances to complete before running the restore tests
Link to run on my pipeline: https://buildkite.com/firecracker/jack-a-b/builds/97/steps/canvas
Link to an example negative test proving works with incompatible kernels: https://buildkite.com/firecracker/jack-a-b/builds/94/steps/canvas?sid=019dbb56-6a1c-49d9-ab8f-cc74a38f3301&tab=output
Reason
...
License Acceptance
By submitting this pull request, I confirm that my contribution is made under
the terms of the Apache 2.0 license. For more information on following Developer
Certificate of Origin and signing off your commits, please check
CONTRIBUTING.md.PR Checklist
tools/devtool checkbuild --allto verify that the PR passesbuild checks on all supported architectures.
tools/devtool checkstyleto verify that the PR passes theautomated style checks.
how they are solving the problem in a clear and encompassing way.
in the PR.
CHANGELOG.md.Runbook for Firecracker API changes.
integration tests.
TODO.rust-vmm.